    
"""
Purpose: Benchmarking different BERT (Bidirectional Encoder Representation from Transformers) variants as the embedding tool based on the machine learning classficiation/prediction performances  

Background:
   (1) 923 pairs of drug interventional clinical trials (Abbreviated as 'clinical trials', 'drug trials' or simply 'trials' in the following contents) were obtained through a series of analyses. Each pair of clinical trials contained a succeeded Phase 2 drug internventional clinical trial, and the same drug's associated Phase 3 clinical trial that was launched / initiated later after the success of that Phase 2 drug interventional. Each pair of clinical trials focused and tested the same drug.

   (2) From dataset and records of drug interventional clinical trials, the texts of Phase 2 clinical trials' "Study Description" were extracted as one of many features for predicting the success of clinical trials.

   (3) These texts, as one of features, were embedded (vectorized) by different BERT-based models (Bidirectional Encoder Representation from Transformers), such as the BioBERT, the Science BERT, the Clinical BERT.

   (4) Using these vectors, machine learning models were trained, validated and benchmarked to identify whether the "Study Description" text information of the clinical trials can be used to efficiently predict the outcome (success or failure) of the associated Phase 3 clininical trials, and which embedding tool, i.e., those variants of BERT moddel work best for above machine learning prediction tasks.


Author : H. Lin, Ph.D., https://orcid.org/0000-0003-4060-7336 

This script contained Python programming codes

Version: Created on 2nd May 2025, and updated on 29th Aug. 2025 (Lastest)

PS: Here we don't cover details about BERT and variants of BERT. Simply, for example, the clinical BERT was a pre-trained (not a fine-tuned one) large language model trained using large amounts of clinical documents and the BERT architecture. For example, those electronic health records (EHRs). Similarly, the PubMed BERT was also a BERT architecture based large language model but pre-trained using those literatures from the PubMed database. For more details, please search the Internet about the names of BERT variants by yourself.

"""


 
# Loading embedded dataset from local disk to the python computing/programming environment, in the format of Pandas Data Frame 
import pandas # The Pandas module is required for loading the .csv files from the local disk into the python environment


biovec = pandas.read_csv( # Load the vectorset of Phase 2 trials' 'Study Description' Section texts embedded using the BioBERT model, from local disk file to the python computing environment.
  "/Local_disk_path/BioBERT_vectors.csv",
  header=None) ;  



scivec= pandas.read_csv( # Load the vectorset of Phase 2 trials' 'Study Description' Section texts embedded using the Science BERT model, from local disk file to the python computing environment.
  "/Local_disk_path/Science_BERT_vectors.csv",
    header=None) ;


sentvec = pandas.read_csv( # Load the vectorset of Phase 2 trials' 'Study Description' Section texts embedded using the Sentence BERT model, from local disk file to the python computing environment.
  "/Local_disk_path/Sentence_BERT_vectors.csv",
    header=None) ;


pmvec = pandas.read_csv( # Load the vectorset of Phase 2 trials' 'Study Description' Section texts embedded using the PubMed BERT model, from local disk file to the python computing environment.
  "/Local_disk_path/PubMed_BERT_vectors.csv",
    header=None) ;


clinvec = pandas.read_csv( # Load the vectorset of Phase 2 trials' 'Study Description' Section texts embedded using the Clinical BERT model, from local disk file to the python computing environment.
  "/Local_disk_path/Clinical_BERT_vectors.csv",
    header=None) ;


# Import the module for data split
from sklearn.model_selection import train_test_split 


# Spltting the dataset for machine learning training and validation 
X_train, X_test, y_train, y_test = train_test_split(

	biovec, # Test the BioBERT embedded vectors of the text feature of Phase 2 clinical trial 'Study Description' section. Here this argument can be replaced by other vectorset embedded using other BERT variants. e.g., the 'scivec' variable, and the 'clinvec' variable mentioned above.

	# scivec, # Test the BioBERT embedded vectors of the text feature of Phase 2 clinical trial 'Study Description' section

  # clinvec, # Test the Clinical BERT embedded vectors of the text feature of Phase 2 clinical trial 'Study Description' section  

  # pmvec, # Test the PubMed BERT embedded vectors of the text feature of Phase 2 clinical trial 'Study Description' section

  # sentvec, # Test the Sentence BERT embedded vectors of the text feature of Phase 2 clinical trial 'Study Description' section
  

  # Sepcify the label set
  labels_y, 
  test_size=0.3, 
  random_state = 99, 
  stratify=labels_y  # the "stratify=labels_y" is important for maintaining the original data ratio of positive/negative classes instance in the data splitting    
  ) 



from lazypredict.Supervised import LazyClassifier 
   
clf = LazyClassifier( # Specify arguments for the classifier
	verbose=110, 
	ignore_warnings=True, 
	custom_metric=None,
	predictions=True,
  random_state=99,
  classifiers="all");

# Training and validate the model using specified dataset 
models, predictions = clf.fit(
	X_train, 
	X_test, 
	y_train, 
	y_test); 


# print and show the resultant metrics of split training and validation sets
print(models) 



""" Above machine learning test results were as shown below

################  Below showed the resultant metrics of machine learning classification using BioBERT embedded feature vectors of the text feature of Phase 2 clinical trial 'Study Description' section

                                  Accuracy  Balanced Accuracy  ROC AUC  F1 Score  Time Taken
Model
LinearSVC                          0.83               0.81     0.81      0.83        0.31
LinearDiscriminantAnalysis         0.83               0.81     0.81      0.83        0.13
LogisticRegression                 0.82               0.80     0.80      0.82        0.06
RidgeClassifierCV                  0.82               0.80     0.80      0.82        0.10
RidgeClassifier                    0.82               0.79     0.79      0.82        0.03
AdaBoostClassifier                 0.81               0.78     0.78      0.81        1.36
PassiveAggressiveClassifier        0.81               0.78     0.78      0.80        0.04
XGBClassifier                      0.81               0.78     0.78      0.81        0.82
Perceptron                         0.79               0.78     0.78      0.79        0.03
RandomForestClassifier             0.82               0.78     0.78      0.82        0.63
LGBMClassifier                     0.81               0.77     0.77      0.80        0.33
BaggingClassifier                  0.80               0.77     0.77      0.80        1.48
NuSVC                              0.81               0.77     0.77      0.80        0.08
ExtraTreeClassifier                0.80               0.76     0.76      0.79        0.03
DecisionTreeClassifier             0.79               0.76     0.76      0.79        0.19
SGDClassifier                      0.79               0.75     0.75      0.78        0.03
QuadraticDiscriminantAnalysis      0.75               0.75     0.75      0.75        0.14
LabelSpreading                     0.81               0.74     0.74      0.79        0.05
LabelPropagation                   0.80               0.74     0.74      0.79        0.05
ExtraTreesClassifier               0.79               0.74     0.74      0.78        0.26
SVC                                0.75               0.68     0.68      0.74        0.08
CalibratedClassifierCV             0.75               0.68     0.68      0.73        1.21
GaussianNB                         0.64               0.66     0.66      0.64        0.03
KNeighborsClassifier               0.72               0.65     0.65      0.71        1.64
BernoulliNB                        0.68               0.64     0.64      0.67        0.03
NearestCentroid                    0.64               0.63     0.63      0.65        0.03
DummyClassifier                    0.66               0.50     0.50      0.52        0.02




#################### Below showed the resultant metrics of machine learning classification using Science BERT embedded feature vectors of the text feature of Phase 2 clinical trial 'Study Description' section

                                  Accuracy  Balanced Accuracy  ROC AUC  F1 Score  Time Taken
Model
LogisticRegression                 0.83               0.82     0.82      0.83        0.06
RidgeClassifierCV                  0.82               0.80     0.80      0.82        0.11
LinearDiscriminantAnalysis         0.81               0.80     0.80      0.81        0.13
RidgeClassifier                    0.81               0.80     0.80      0.81        0.04
Perceptron                         0.82               0.80     0.80      0.82        0.03
RandomForestClassifier             0.83               0.79     0.79      0.83        0.71
PassiveAggressiveClassifier        0.81               0.78     0.78      0.80        0.04
BaggingClassifier                  0.81               0.78     0.78      0.81        1.56
LinearSVC                          0.80               0.78     0.78      0.80        0.33
LGBMClassifier                     0.81               0.78     0.78      0.81        0.34
XGBClassifier                      0.82               0.77     0.77      0.81        0.74
AdaBoostClassifier                 0.81               0.77     0.77      0.80        1.37
DecisionTreeClassifier             0.81               0.77     0.77      0.80        0.19
NuSVC                              0.81               0.77     0.77      0.80        0.08
ExtraTreesClassifier               0.81               0.76     0.76      0.80        0.26
CalibratedClassifierCV             0.79               0.75     0.75      0.78        1.22
QuadraticDiscriminantAnalysis      0.75               0.75     0.75      0.76        0.13
SGDClassifier                      0.77               0.74     0.74      0.77        0.03
LabelPropagation                   0.81               0.74     0.74      0.79        0.04
LabelSpreading                     0.81               0.74     0.74      0.79        0.05
ExtraTreeClassifier                0.77               0.73     0.73      0.76        0.03
SVC                                0.76               0.68     0.68      0.74        0.08
KNeighborsClassifier               0.73               0.65     0.65      0.71        0.04
NearestCentroid                    0.66               0.63     0.63      0.66        0.03
GaussianNB                         0.66               0.62     0.62      0.66        0.03
BernoulliNB                        0.63               0.61     0.61      0.63        0.03
DummyClassifier                    0.66               0.50     0.50      0.52        0.03






#################### Below showed the resultant metrics of machine learning classification using Clinical BERT embedded feature vectors of the text feature of Phase 2 clinical trial 'Study Description' section

                                  Accuracy  Balanced Accuracy  ROC AUC  F1 Score  Time Taken
Model
LinearDiscriminantAnalysis         0.83               0.82     0.82      0.83        0.14
RidgeClassifierCV                  0.83               0.81     0.81      0.83        0.11
SGDClassifier                      0.83               0.80     0.80      0.83        0.04
LinearSVC                          0.82               0.80     0.80      0.82        0.32
LogisticRegression                 0.82               0.80     0.80      0.82        0.06
RidgeClassifier                    0.81               0.80     0.80      0.81        0.04
LGBMClassifier                     0.82               0.79     0.79      0.82        0.36
PassiveAggressiveClassifier        0.81               0.79     0.79      0.81        0.04
XGBClassifier                      0.83               0.79     0.79      0.82        0.75
BaggingClassifier                  0.81               0.78     0.78      0.81        1.40
RandomForestClassifier             0.82               0.77     0.77      0.81        0.64
NuSVC                              0.81               0.76     0.76      0.80        0.08
ExtraTreeClassifier                0.80               0.76     0.76      0.80        0.03
ExtraTreesClassifier               0.81               0.75     0.75      0.79        0.26
DecisionTreeClassifier             0.79               0.74     0.74      0.78        0.27
QuadraticDiscriminantAnalysis      0.74               0.74     0.74      0.75        0.15
LabelSpreading                     0.81               0.74     0.74      0.79        0.05
LabelPropagation                   0.81               0.74     0.74      0.79        0.04
SVC                                0.79               0.72     0.72      0.77        0.08
AdaBoostClassifier                 0.75               0.71     0.71      0.75        1.37
CalibratedClassifierCV             0.77               0.69     0.69      0.75        1.17
Perceptron                         0.74               0.69     0.69      0.74        0.03
GaussianNB                         0.71               0.67     0.67      0.71        0.03
KNeighborsClassifier               0.73               0.66     0.66      0.72        0.04
BernoulliNB                        0.66               0.62     0.62      0.66        0.03
NearestCentroid                    0.65               0.60     0.60      0.64        0.03
DummyClassifier                    0.66               0.50     0.50      0.52        0.02






#################### Below showed the resultant metrics of machine learning classification using PubMed BERT embedded feature vectors of the text feature of Phase 2 clinical trial 'Study Description' section

                                  Accuracy  Balanced Accuracy  ROC AUC  F1 Score  Time Taken
Model
LogisticRegression                 0.84               0.83     0.83      0.84        0.06
LinearDiscriminantAnalysis         0.83               0.82     0.82      0.83        0.12
RidgeClassifierCV                  0.83               0.82     0.82      0.83        0.11
XGBClassifier                      0.85               0.82     0.82      0.85        0.72
LinearSVC                          0.83               0.82     0.82      0.83        0.29
RidgeClassifier                    0.82               0.81     0.81      0.82        0.04
Perceptron                         0.82               0.80     0.80      0.82        0.03
LGBMClassifier                     0.83               0.80     0.80      0.83        0.43
PassiveAggressiveClassifier        0.81               0.79     0.79      0.81        0.04
SGDClassifier                      0.82               0.79     0.79      0.82        0.03
CalibratedClassifierCV             0.82               0.78     0.78      0.82        1.15
RandomForestClassifier             0.82               0.77     0.77      0.81        0.66
BaggingClassifier                  0.80               0.76     0.76      0.79        1.39
ExtraTreesClassifier               0.82               0.76     0.76      0.81        0.26
DecisionTreeClassifier             0.79               0.76     0.76      0.79        0.29
ExtraTreeClassifier                0.79               0.75     0.75      0.79        0.03
QuadraticDiscriminantAnalysis      0.75               0.75     0.75      0.76        0.15
AdaBoostClassifier                 0.79               0.75     0.75      0.79        1.38
NuSVC                              0.79               0.75     0.75      0.79        0.08
LabelPropagation                   0.80               0.74     0.74      0.79        0.04
LabelSpreading                     0.80               0.74     0.74      0.79        0.04
SVC                                0.80               0.73     0.73      0.78        0.08
KNeighborsClassifier               0.75               0.69     0.69      0.74        0.04
GaussianNB                         0.67               0.65     0.65      0.68        0.03
BernoulliNB                        0.67               0.65     0.65      0.68        0.03
NearestCentroid                    0.63               0.62     0.62      0.64        0.03
DummyClassifier                    0.66               0.50     0.50      0.52        0.02





#################### Below showed the resultant metrics of machine learning classification using Sentence BERT embedded feature vectors of the text feature of Phase 2 clinical trial 'Study Description' section

                                  Accuracy  Balanced Accuracy  ROC AUC  F1 Score  Time Taken
Model
LogisticRegression                 0.84               0.82     0.82      0.84        0.03
LinearDiscriminantAnalysis         0.82               0.82     0.82      0.83        0.07
LinearSVC                          0.83               0.82     0.82      0.83        0.85
RidgeClassifier                    0.83               0.81     0.81      0.83        0.02
BaggingClassifier                  0.83               0.80     0.80      0.83        0.82
LGBMClassifier                     0.83               0.80     0.80      0.83        0.17
RidgeClassifierCV                  0.83               0.80     0.80      0.83        0.06
PassiveAggressiveClassifier        0.82               0.79     0.79      0.82        0.03
NuSVC                              0.83               0.78     0.78      0.82        0.04
ExtraTreeClassifier                0.80               0.76     0.76      0.79        0.02
SGDClassifier                      0.79               0.76     0.76      0.79        0.02
RandomForestClassifier             0.81               0.76     0.76      0.80        0.47
XGBClassifier                      0.79               0.76     0.76      0.79        0.43
DecisionTreeClassifier             0.78               0.75     0.75      0.78        0.14
QuadraticDiscriminantAnalysis      0.75               0.75     0.75      0.76        0.08
SVC                                0.81               0.75     0.75      0.80        0.04
LabelSpreading                     0.81               0.74     0.74      0.79        0.03
ExtraTreesClassifier               0.80               0.74     0.74      0.79        0.19
Perceptron                         0.78               0.74     0.74      0.77        0.02
BernoulliNB                        0.77               0.74     0.74      0.76        0.02
LabelPropagation                   0.80               0.74     0.74      0.79        0.03
AdaBoostClassifier                 0.78               0.73     0.73      0.77        0.72
GaussianNB                         0.74               0.70     0.70      0.74        0.02
KNeighborsClassifier               0.77               0.70     0.70      0.76        0.03
NearestCentroid                    0.67               0.66     0.66      0.68        0.02
CalibratedClassifierCV             0.70               0.57     0.57      0.62        2.69
DummyClassifier                    0.66               0.50     0.50      0.52        0.02
"""